6. Machine Learning and Causal Inference

Big data in Economics

Juan D. Montoro-Pons | 2024/25

Machine learning applications: beyond prediction

 

Could we use machine learning predictive models to get unbiased causal effect estimations with confidence intervals?

  • Metalearners
  • Machine learning models designed for learning causal effects (e.g. causal trees and causal forest)
  • Double machine learning
  • Synthetic controls
  • Matrix completion methods

Why machine learning?

 

  • Regression models in a high-dimensional feature space
  • Nonlinearities between outcome \(Y\), covariates \(X\) and treatment \(D\)
  • Use of alternative data sources, e.g., unstructured datasets (text or images)

BUT ML models are good for prediction (not causal inference)

Causal inference: basics

 

  • Directed acyclic graphs (DAGs)
  • Experimental and observational data
  • Confounders
  • Identification
  • Nuisance variables/parameters

What is the effect of variable \(D\) (e.g. having a college degree) on some response \(Y\) (e.g. income)

Double machine learning

The problem

 

We consider the partially linear model

\[ \begin{align}\begin{aligned}y = \tau \, D + g(X) + \epsilon, & &\epsilon \sim \mathcal{N}(0,1),\\ D = m(X) + v, & &v \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]

Furthermore it is considered that confusion is complex and/or high dimensional.

Goal: estimate the causal parameter \(\tau\)

A naive approach

 

To estimate \(\tau\) in

\[y = \tau \, D + g(X) + \epsilon\]

  1. Train a LASSO regularized model for \(y\) on \(D\) and \(T(X)\) (using a dictionary of transformations). Select the \(T(X)\) variables whose coefficients are \(\neq 0\). Call these \(X_y\)
  2. Fit a linear model of \(y\) on \(D\) and \(X_y\). Use conventional inference procedures on \(\hat{\tau}\)

The estimate suffers from regularization bias. See Chernozhukov et al. (2018).

 

Understanding regularization bias

The reason that the naive estimator does not perform well is that it only selects controls that are strong predictors of the outcome, thereby omitting weak predictors of the outcome. However, weak predictors of the outcome could still be strong predictors of \(D\), in which case dropping these controls results in a strong omitted variable bias.

Frisch-Waugh-Lovell

\[y=\beta_0+\beta_1 x_1+\beta_2 x_2 + \beta_3 x_3+\epsilon\]

Code
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# Generate synthetic data
np.random.seed(42)  # For reproducibility

n_samples = 1000

X1 = np.random.normal(0, 1, n_samples)
X2 = np.random.normal(0, 0.5, n_samples)
X3 = np.random.normal(0, 0.5, n_samples)

# Generate outcome Y with linear effect of X1 and nonlinear effects of X2, X3
Y = 2 * X1 + 5 * X2+0.5*X3 + np.random.normal(0, 1, n_samples)

# Create a pandas DataFrame
data = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'Y': Y})

# Multivariate regression

reg = sm.OLS(Y, sm.add_constant(data[['X1','X2','X3']])).fit()
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.900
Model:                            OLS   Adj. R-squared:                  0.899
Method:                 Least Squares   F-statistic:                     2972.
Date:                Tue, 01 Apr 2025   Prob (F-statistic):               0.00
Time:                        12:29:31   Log-Likelihood:                -1443.4
No. Observations:                1000   AIC:                             2895.
Df Residuals:                     996   BIC:                             2914.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0145      0.033     -0.445      0.656      -0.078       0.049
X1             1.9832      0.033     59.716      0.000       1.918       2.048
X2             4.8865      0.065     74.953      0.000       4.759       5.014
X3             0.5445      0.066      8.240      0.000       0.415       0.674
==============================================================================
Omnibus:                        2.695   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                2.349
Skew:                           0.000   Prob(JB):                        0.309
Kurtosis:                       2.763   Cond. No.                         2.05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

\[y-(y \sim x_2,x_3)= \beta_0 + \beta_1 (x_1 - (x_1 \sim x_2,x_3) ) + \epsilon\]

Code
# Frisch-Waugh-Lovell with linear first-stage regressions

# First-stage regressions (linear)
reg_X1_X2X3 = LinearRegression().fit(data[['X2', 'X3']], data['X1'])
X1_resid = data['X1'] - reg_X1_X2X3.predict(data[['X2', 'X3']])

reg_Y_X2X3 = LinearRegression().fit(data[['X2', 'X3']], data['Y'])
Y_resid = data['Y'] - reg_Y_X2X3.predict(data[['X2', 'X3']])

# Second-stage regression (linear)
reg_resid_resid = sm.OLS(Y_resid, sm.add_constant(X1_resid)).fit()

# Summary
print(reg_resid_resid.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.782
Model:                            OLS   Adj. R-squared:                  0.781
Method:                 Least Squares   F-statistic:                     3573.
Date:                Tue, 01 Apr 2025   Prob (F-statistic):               0.00
Time:                        12:29:31   Log-Likelihood:                -1443.4
No. Observations:                1000   AIC:                             2891.
Df Residuals:                     998   BIC:                             2901.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       8.674e-18      0.032   2.67e-16      1.000      -0.064       0.064
X1             1.9832      0.033     59.776      0.000       1.918       2.048
==============================================================================
Omnibus:                        2.695   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                2.349
Skew:                           0.000   Prob(JB):                        0.309
Kurtosis:                       2.763   Cond. No.                         1.02
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

 

Let \(y\) be the outcome and \(x_1\) the variable of interest. The coefficient of \(x_1\) on \(y\) can be derived from a two stage process:

  • Regress \(y\) and \(x_1\) on nuisance variables \(x_2,\, x_3\). Generate predictions for both
  • Regress \(\tilde{y}\) (residuals of \(y\)) on \(\tilde{x_1}\) (residuals of \(x_1\))

\[y-E[y|x_2,x_3]= \beta_0 + \beta_1 (x_1 - E[x_1|x_2,x_3]) + \epsilon\] \[\tilde{y}=\beta_0+\beta_1 \tilde{x}_1+\epsilon\]

Combining FWL with flexible regression models

Back to the partially linear model

\[ \begin{align}\begin{aligned}y = \tau \, D + g(X) + \epsilon, & &\epsilon \sim \mathcal{N}(0,1),\\ D = m(X) + v, & &v \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]

  • We can draw on the flexibility of machine learning models to capture interactions and non-linearities when estimating \(Y\) and \(D\) and to deal in a high-dimensional feature space
  • Upside: no need to make parametric assumptions about the relationship between the covariates and the outcome/treatment
  • Provided we don’t have unobserved confounders, we can recover the parameter of interest using the following procedure:

\[y-\hat{g}_y(X) = \tau (D-\hat{m}_t(X))+\epsilon\]

Using machine learning one gains flexibility: learners can capture complicated functional forms in the nuisance relationships. But that flexibility also introduces the possibility of overfitting.

Cross-fitting

Residuals are generated using-out-of-fold predictions.

Double machine learning

 

Naive inference based on a direct application of machine learning methods to estimate the causal parameter is generally invalid. There are two different sources of bias: bias due to regularization and to overfitting

Double machine learning embeds two mechanisms to avoid these bias:

  1. Orthogonalization (residual on residual regression) addresses regularization bias
  2. Cross-fitting addresses overfitting bias

Double machine learning: ingredients

 

  1. Use machine learning algorithms to obtain flexible estimates of the conditional means \(g\) and \(m\)

  2. Use sample splitting (k-fold cross-fitting) for predicting conditional means and computing residuals (for outcome and treatment)

  3. Run a residual on residual regression to retrieve the causal effect

Main result of the DoubleML estimator: unbiasednes and approximate normality

Double machine learning: practical questions

 

  • Simulation results show that the quality of causal estimates is affected by predictive performance (choice of learners and tuning parameters is crucial)

  • Recommended diagnostics:

    • Combined loss of learners
    • Provide range for causal estimates according to different learners/parameters
    • Use of benchmark models